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A theoretical study of the measurement effectiveness 
of flexilevel tests 

Abstract 

A flexilevel test is found to be inferior to a peaked conventional 
test for measuring examinees in the middle of the ability range, superior 
for examinees at the extremes. Throughout the entire range of ability, a 
flexilevel test is much superior to any conventional test that attempts 
to provide accurate measurement at both extremes* 
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A THEORETICAL STUDY OF THE MEASUREMENT EFFECTIVENESS 



OF FLEXILEVEL TESTS* 



A conventional test becomes a flexilevel test when modified so that 
the examinee follows these rules: 

1. Answer first a specified test item of median difficulty. 

2. After answering an item correctly, attempt next the easiest 



A special answer sheet is used so that the examinee will know whether each 
answer is correct or incorrect. If the conventional test contains N 
items, the examinee taking the flexilevel test will attempt only n = (N + l)/2 
of these. A method for implementing flexilevel testing is described by 
lord (1970). 

Surprisingly, it appears that number-right scoring is quite effective 
for flexilevel tests (Lord, 1970), in spite of the fact that different 
examinees answer different sets of items. A worthvhile refinement, used 

throughout the research reported here, is to add one-half score point to 

- „ * 

the number-right score of each examinee who answered his last -attempted 
Item incorrectly. 

A crucial question is whether flexilevel testing will be too confusing 
or too time-consuming for many examinees. Empirical studies are needed to 
answer this and other questions of practical effectiveness. . 

*Thi8 work was supported in part by contract N00014-69-C-0017, project 
designation NR 150-303; between the Personnel and Training Research Programs 
Office, Psychological Sciences Division, Office of Naval Research and 
Educational Testing Service. Reproduction in whole or in part is permitted 
for any purpose of the United States Government. 



unanswered itev of more -than -median difficulty. After 



answering an item incorrectly, attempt next the hardest 



unanswered item of le 3 s -than -median difficulty 
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Since a theoretical study can be done more quickly and less expen- 
sively than a substantial empirical study, the study reported here was 
carried out in order to evaluate various flexilevel tests from a theoreti- 
cal point of view. A further purpose was to try to separate some flexi- 
level designs that are worth trying out empirically from those that are 
altogether inferior to other tests. 

In order to carry out a theoretical investigation of this type> it 
is necessary to be able to predict probabilistically how a given examinee 
will respond to items different from those already administered. Con- 
sequently, the present results are derived from item characteristic curve 
theory (see, for example, Lord, 1968, sections 3 - 4 ). 

Here we assume the probability P^ that a given examinee will 
ancver item i correctly depends only on his "ability" level, denoted by 
Q , and on certain item parameters* a ("discriminating power"), b 
("difficulty"), and c ("pseudo chance -score level"). These item param- 
eters are assumed to have been already determined, to an adequate approxi- 
mation, by pretesting. 

Conditional Frequency distri b ution of Test Score 

We can evaluate any given flexilevel test once we can determine 
f(xl$) , the conditional frequency distribution of test scores x for 
examinees at ability level 9 • Oiven some mathematical form for the 
function ■ P^ftf) * Pfaja^b^, c ^) , the value of f(xl$) can be deter- 
mined numerically for any specified value of $ by the recursive method 
outlined below. 
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Assume the N test items to be arranged in order of difficulty, as 
measured by the parameter b^ • We will choose N to be an odd number. 

For present purposes (not for actual test administration) identify the 
items by the index 1 , taking on the values -n + 1 , -n + 2 , . . . , 

-1, 0, 1, ... , n - 2 , n - 1 , respectively, when the items are arranged 
in order of difficulty. Thus b^ is the median item difficulty. 

Consider, for example, the sequence of right ( R ) and wrong ( W ) 
answers ^ 



RWWRWRRRWR 

Following the rules given for a flexilevel test, we see that the corre- 
sponding sequence of items answered is 

i = 0,+l, -1, -2, +2, -3, + 3 < ,+^>+ 5 < > - 4 , +6 

The general rule is that if item i is the v :th iter? administered and 
item j is the (v + l) :th, then, for flexilevel tests, 
either j = i + 1 or j = i - v when i > 0 , 

either j = i - 1 or j = i + v when i < 0 

In the same context, let P ^ E denote the probability 

that item j will be the next item administered after item i . 
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P i (0) 


if i - i + 


V , 


If 


i < 0 , 


P iJ*v = 


Q t ( 9 ) 
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otherwise* 






5 



- 4 - 



For examinees at ability level e , let p^. ( i 1 0 ) denote the probabil- 
ity that item i is the v :th item administered. Clearly, 

P«l( jl, )- l [l 1 ’v < 1 ' ,)P iJ» • (1) 

Now, the first item administered ( v = 1 ) is always item i = 0 , so 

! 1 if i = 0 , 

' 

0 otherwise • 

Starting with this fact and with a knowledge of c 11 the P^(a) (determined 
from pretest data), equation (l) allows us to compute the values of p v (i|0) 
for each i , for v = 2, 3 , **.,n , and for any specified set of values of 

0 . 

Now we can make use of a readily verified feature of flexilevel tests. 
Again let j represent the (v + l) jth item to be administered. If j > 0 , 
then the number -right score r on the v items already administered is 
r *= j ; if j < 0 , then r -= v + j . 

Thus the frequency distribution of the number-right score r for ex- 
aminees at ability level 0 is given by P n+1 (r le) for those examinees 
who answered correctly the n sth (last) item administered, by P n+1 (r-n|o) 
for those who answered incorrectly. This frequency distribution can be 
computed recursively from (l). 

As already noted, the actual score assigned on a flexilevel test is 
if the last item is answered correctly, x « r + ^ if it is 



x = r 
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ansvered incorrectly. Consequently the conditional distribution of test 
scores is 



f(xle) 



P n+ i^ x I 0 ) if x ls an integer, 
P n+1 (x-n-i|0) if x is a half-integer. 



( 2 ) 



For any specified test design, this conditional frequency distribution 
f ( x 1 0) cen be computed for x = li, ...,n for various values of 

Q . Such distributions constitute the totality of possible information 
relevant to evaluating the effectiveness of x as a measure of ability* 

Evaluating a Flexilevel Testing Pr o cedure 



If ve are to use x 
mean of x when 0 = 0 ^ 
seems natural to use the 



as a measure of ability, ve vould like n 1 

x 1 0^ 

) to differ from pi- whenever 0, i Q r . it 

1 d. 

"critical ratio" 



(the 




to summarize the effectiveness of x for discriminating between ability 
levels 0^ and 0^ *= 0^ + A , where is the conditional standard 

deviation of x and A represents a small increment in ability (small 
enough so that ° x \g * a x |g 4 ^ approximately). 

Actually we will work with the square of this ratio: 

i x (e) - N« } , (- 

a x|e 
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vhere k is any convenient constant. Given some small increment A , 
1^(0} , as a function of 0 , is readily computed from (2) for any speci 
fied test design. Since ve are only interested in comparisons between 
designs, the values of k and A are of no importance so long as they 
are the same for all designs compared. 



The numerical results reported here are obtained on the assumption 
that P^ is a normal ogive, possibly modified to accommodate the effects 
of success due to guessing: 



where $(t) is the normal density function. The results would presumably 
be about the same if had been assumed logistic rather than normal 

ogive. 

To keep matters simple, we will only consider tests in which all items 
have the same discriminating power, a j also the same pseudo chance level, 
c # Results are presented here separately for c = 0 (no gueBsing) and 
c = .2 • The results are general for any value of a > 0 , since a can 
be absorbed into the unit of measurement chosen for the ability scale (as 
will be noticed for the base line shown in the figures). 

In all tests studied, each examinee answers exactly n <= 60 items. 

For simplicity, we will consider only tests in which the item dif fie ^ties 
form an arithmetic sequence, so that - b^ « d , say. 



Test Designs Studied 



a(0-b i ) 




( 4 ) 
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Results for Tests w ith No Guessing 

Figure 1 compares the effectiveness of four 60-item ( n = 60 t N - 11 y ) 
flexilevel tests with each other and with three bench mark tests* The 
scale chosen for G in the figures is such that for typical aehievemr.it 
and aptitude tests the standard deviation of 0 in typical high school 
and college groups will be very roughly a = l/2a (a more detailed 
explanation is given in Lord, 1969). 

The "standard test" is a conventional 60-item test composed entirely 
of items of difficulty b - 0 , scored by counting the number of right 
answers. There is no guessing, so c = 0 . The values of a and c 
are the same for bench mark and flcxiJ.evel tests. For fixed a and c , 
no test composed of dichotomously scored items with characteristic curves 
(4) can have a higher value of 1^(0) a ^ an y 0 than the standard test has 
at Q = b^ (see Birnbaura, 1968). 

As would be expected, rhe figure show3 that the standard test is 
best for discriminating among examinees at ability levels near 0=0. 

If good discrimination is important at 0 = + 2/2a or 0 = + 3/2a , then 
a flexilevel test such as the one with d = «033/2a or d ~ .050/2a is 
better. The larger d is, the poorer the measurement at 0 = b^ , but 
the better the measurement at extreme values of 0 • 

Suppose the best possible measurement i3 required at 0 = ±2 , with 
a-= 0.5 * It might fee thought that an effective conventional 60-item test 
for this limited purpose would consist of ^0 items a* b = +2 and 30 
items at b = -2 . The curve for tnis last test is shown in Figure 1. 
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The fact is that with a = 0*5 , no unpeaked test (i.e*, no test with 
items at more than one difficulty level) can simultaneously measure as 
veil at both 0 - +2 and 0 = -2 as does the standard test (which has 
all items peaked at b = 0 )• 



The situation is different if the best possible measurement is required 
at 0 s= ±5 , with a = 0-5 • Using dichotomously scored items, the best 
60-item conventional test for this purpose consists of ?0 items at 
b - - 2.8 and 30 items at b = + 2 . 8 , approximately* The curve for this 
test is shown in Figure 1 . 

For fixed 0 , the number-right score x on a standard test has a 
binomial distribution. Thus, the expected score is 



where P 5 P(e) is given by ( 4 ). It is apparent from ( 5 ) that X x (a ) 
for a standard test is proportional to n , the tesr length. 

We now see that when a = 0*5 > the 60 -item flexilevel test with 
d = . '53 gives about as effective measurement as a 
j;! i -item standard test at 0 = 0, 

60-item standard teat at 9 *= ±1 , 

69-item standard test at 0 « ±2 , 



Ms ‘ nP 



and the variance of the scores is 




86-item standard test at 0 » i3 
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At 0 = +3 j the 60 -item flexilevel test with d = .1 is as effective 
as a 96-item standard test. 



Re sults for Tests with Guessing 

Figure 2 compares the effectiveness of three 60 -ite~ flexilevel tests 
with each other and with five bench mark tests. All items have c = 0.2 
and all have the same discriminating power a . The standard test is a 
conventional 60-item test with all items at difficulty level b - 0-5/2a , 

8 cored by counting the number of right answers. 

If all the item difficulties in any test were changed by some constant 
amount fb , the effect would be simply to translate the corresponding 
curve by an amount fb alipng the 0 -axis. The difficulty level of each 
bench mark test and the starting item difficulty level b^ of each flexi- 
level test in Figure 2 has been chosen so as to give maximum discriminating 
power somewhere in the neighborhood of 0 = 0. 

The standard test is again found to be best for discriminating among 
examinees at ability levels near 0 = 0 . At 0 = + 2 the flexilevel tests 
are better than the standard test, which in turn seems to be better than 
any of the other conventional (bench mark) tests, although the situation 
is less clear than before because of the asymmetry of the curves . 

When a = 0*5 the 60 -item flexilevel test with b^ = - 0-9 and 
d = .033 gives about as effective measurement as a 
58-item standard test at 0^0 
60-item standard test at 0^+1 
70-item standard test at 0 = -2.0 or 0 = +2.25 
83-itcm standard test at 0 ^ +3 



O 

ERIC 



ll^-item standard test t 0 = -3 



12 



- 11 - 



a t q = -3 f the 60-item flexilevel test with b Q = -1.3 and d = ^ 06 ^ 
is as effective as a 137-item standard te3t. 

Conclusion 

Near the middle of the ability range for which the test is designed, 
r A flexilevel test is let,s effective than is a comparable peaked con- 
ventional test. In the outlying half of the ability range, the flexi- 
level test provides more accurate measurement in typical aptitude and 
achievement testing situations than a peaked conventional test composed 
of comparable items. This comparison assumes that 60 items are adminis- 
tered to each examinee. The advantage of flexilevel tests over conven- 
tional tests at low ability levels is significantly greater when there 
t* guessing than when there is not. 

Empirical studies will be needed to answer such questions as the 
following: 

1. To what extent are different types of examinees confused by 
flexilevel testing? 

2. To what extent does flexilevel testing lose efficiency 
because of an increase in testing time per item? 

3. >Iow adequately can wo score the examinee who does not 
have time to finish the test? 

4. How can we score the examinee who does not follow d'rections? 

5. What other serious inconveniences and complications are 
there in flexilevel testing? 
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6* 13 the examinee's attitude and performance improved when 

a flexilevel test ’’tailors' the test difficulty level to 
match his ability level? 

Empirical investigations should study tests designed in accordance 
vith the theory used here* Otherwise, it is likely that a poor choice of 
d and especially of b 0 will result in an ineffective measuring 
instrument. 

The most likely application of flexilevel tests is in situations 
where it would otherwise be necessary to unpeak a conventional test in 
an attempt to obtain adequate measurement at the extremes of the ability 
range. Such situations are found in nationwide college admissions testing 
and elsewhere. 
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